Systolic equalizer and method of using same

ABSTRACT

A method and apparatus provide a systolic equalizer for Viterbi equalization of an 8-PSK signal distorted by passage through a communication channel. The systolic equalizer architecture is scalable to process, as examples, four, eight and 16 state received signals. An equalizer in accordance with this invention includes a logical arrangement of a plurality of instantiations of locally coupled processing elements forming a systolic array for processing in common received signal samples having distortion induced by passage through a communication channel. The equalizer outputs soft values for input to a decoder, the soft values representing an approximation of maximum a posteriori (MAP) probabilities. A trellis search procedure is employed to reconstruct estimates of a received signal sequence based on a reduced number of states. The reduced number of states is represented by a plurality of groups determined by partitioning a symbol constellation such that there are fewer groups than possible symbols.

CLAIM OF PRIORITY FROM COPENDING PATENT APPLICATION

This patent application claims priority under 35 U.S.C. §119(e) fromProvisional Patent Application No. 60/349,306, filed Oct. 26, 2001, andPCT/US02/34284, filed Oct. 25, 2002 reference herein in its entirety.

FIELD OF THE INVENTION

This invention relates generally to electronic circuits known asequalizers. More particularly, this invention relates to a Very LargeScale Integration (VLSI) architecture for implementation of theequalizer for communication systems, where the equalizer comprises aplurality of like processing elements (PEs) that are coupled together ina systolic-type of architecture.

BACKGROUND OF THE INVENTION

A communications system equalizer is a circuit used in a receiver tocompensate a received signal for losses and distortion experienced in acommunication path between a transmitter of the signal and the receiver.In RF communication systems, such as cellular telephone systems,conventional practice could construct the equalizer circuit usingdiscrete components or, more recently, using a suitably programmeddigital signal processor (DSP). In this approach the DSP is normally notdedicated to performing only the equalizer function, but more typicallyis responsible for the execution of a number of other signal processingtasks as well. As a result, as data rates continue to increase it hasbeen found that the DSP capacity, and especially the lack of availableDSP capacity, has created a problem. The increase in data rates alsoincreases the equalizer algorithm complexity, and thus requires higherDSP processing performance. Simply using a faster and higher powered DSPalso creates problems, as this approach requires a significant number ofskilled engineers, and a large amount of time, resources and risk, tomigrate the existing DSP-executed software applications to a new DSPplatform. In addition, faster DSPs generally consume more current, whichcan be a significant disadvantage in battery powered devices such ascellular telephones and personal communicators. This situation hascreated a need to transfer the DSP equalizer software functionality tohardware.

An Application Specific Integrated Circuit (ASIC) hardwareimplementation provides more processing power and is more area efficientthan a DSP solution. Thus, there is a need for an equalizer implementedin hardware that is power and area efficient. There is also a need for ascalable equalizer. However, ASIC technology does not provide a quickdesign implementation, and is limited in its ability to be changed toaccommodate revisions to a design or specification.

Trellis searching architectures have been studied for GSM (Global Systemfor Mobile Communications) systems using serial (centralized) andparallel (distributed) approaches, as evidenced by A. Lloyd, M. Reynoldsand Y. Shah, “VLSI Architectures for Viterbi Decoding,” IEE Colloquiumon VLSI Implementations for Second Generation Digital Cordless andMobile Telecommunication Systems, 1990, pp. 6/1–6/7, hereafter referredto as Lloyd and incorporated by reference herein in its entirety. Thescalability of a systolic approach, as discussed by P. Gulak and T.Kailath, “Locally Connected VLSI Architectures for the ViterbiAlgorithm,” IEEE Journal on Selected Areas in Communications, Vol. 6,No. 3, April 1988, pp. 527–537, hereafter referred to as Gulak andincorporated by reference herein in its entirety, has lead to a singletype of processing element (PE) which can be used as the basis foreither a serial or parallel approach to Viterbi decoding. The PE isamenable to the pipelining of computational elements, see P. Pirsch,“Architectures for Digital Signal Processing,” John Wiley, New York,1996, hereafter referred to as Pirsch and incorporated by referenceherein in its entirety, which allows multiple operations to occur oneach clock edge.

In Lloyd a locally connected array is shown (FIG. 3), and these authorsstate that their VLSI architecture is applicable to both Viterbidecoding and Viterbi equalization.

Also of interest is publication by Chakraborty, Mrityunjoy and SuraiyaPervin, “A systolic array realization of the adaptive decision feedbackequalizer”, Signal Processing, Vol. 80 (2000), No.: 12, pp. 2633–2640.

A need exists, as yet unfulfilled prior to this invention, to provide ascalable channel equalizer that is constructed and operated as aparallel, systolic array of like processing elements that exhibits,among other features, reduced state sequence estimation, decisionfeedback and a global search function for metric normalization and softvalue determination in a serial parallel processor structure.

SUMMARY OF THE INVENTION

An embodiment of this invention provides a method that equalizes a phasemodulated signal, e.g., an 8-PSK (Phase Shift Keying) signal, that isdistorted by multipath during passage through a communication channel.Another embodiment of this invention provides apparatus, embodied ascircuitry, that equalizes a phase modulated signal, e.g., an 8-PSKsignal, that is distorted by passage through a communication channel.The preferred embodiment provides a systolic equalizer for realizing theViterbi equalization of an 8-PSK signal.

A presently preferred embodiment of the systolic equalizer of thisinvention employs one Processing Element (PE) for each state of N statesof the equalizer function. In the preferred embodiment each PE has alocal memory (LM) and an associated Record (R). In the presentlypreferred embodiment the equalizer processes one burst of samples of areceived signal at a time, and a channel estimator provides to theequalizer an estimate of the channel, convolved with a pre-filterresponse. The equalizer constructs a look up table (LUT) of the productof individual channel taps with individual constellation points of the8-PSK constellation. An initialization function initializes the NRecords and the N local memories, and a first sample is input to all ofthe PE's in parallel. Each PE then processes a Record and passes theprocessed Record to a neighboring PE. This occurs N times per sample sothat each Record visits every PE. The Records and Local Memories aremodified as required by each PE. After all PEs have processed all of theRecords, a soft output calculation unit obtains data from a terminal(e.g., a rightmost) PE of the array of PEs and produces soft output fora previous symbol. The process can then be repeated for the next sampleuntil the entire burst of samples is processed. Other embodiments of asystolic array equalizer are also provided.

An equalizer in accordance with this invention includes a logicalarrangement of a plurality of instantiations of locally coupledprocessing elements that form a systolic array for processing in commonreceived signal samples having distortion induced by passage through acommunications channel. The equalizer outputs soft values for input to adecoder, the soft values representing an approximation of maximum aposteriori (MAP) probabilities. A trellis search procedure is employedto reconstruct estimates of a received signal sequence based on areduced number of states. The reduced number of states is represented bya plurality of groups determined by partitioning a symbol constellationsuch that there are fewer groups than possible symbols.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other features, aspects, and advantages of embodimentsof this invention are made more apparent in the following description ofpresently preferred embodiments of the invention, when read inconjunction with the accompanying drawings. It is to be understood,however, that the drawings are provided solely for the purposes ofillustration, and are not to be viewed as a definition of the limits ofthe invention.

FIG. 1 is a diagram showing partitioning of an 8-PSK constellation usingJ=4, where in the trellis state, instead of a symbol being one of eight,it becomes a symbol from one of four groups (reduced states), where oneof the symbols of the group is ignored as being very unlikely.

FIGS. 2A, 2B and 2C, collectively referred to as FIG. 2, are a blockdiagram of a single PE and associated components, a block diagram of alinear array of PEs, and a block diagram illustrative of the SystolicEqualizer Architecture in accordance with an embodiment of theinvention, where there is one PE per state.

FIG. 3A is a block diagram illustrative of PE structure in accordancewith an embodiment of the invention, while FIG. 3B shows the criticalprocessing path of the PE of FIG. 3A.

FIGS. 4A and 4B are block diagrams illustrative of the Local Comparesblock of FIG. 3A.

FIGS. 5A and 5B are block diagrams illustrative of the Global Comparesblock of FIG. 3A.

FIG. 6 is a representation of the systolic architecture for N=16.

FIG. 7 is a diagram illustrative of a mapping of a four statearchitecture to one PE in accordance with an embodiment of the presentinvention.

FIG. 8 is a diagram showing how one PE can implement the four stateequalizer.

FIG. 9A shows an example of a trellis for a 16-state equalizer, that isnot fully connected although is nicely grouped, while FIG. 9B shows ablock diagram of the 16 state systolic equalizer, as also shown in FIG.6.

FIG. 10 illustrates a receiver that includes an equalizer that isconstructed and operated in accordance with this invention.

FIGS. 11A, 11B, 11C and 11D, collectively referred to as FIG. 11, areuseful in explaining the operation of the equalizer algorithm, whereFIG. 11A shows the overall algorithm flow and a definition of thenotation, FIG. 11B shows a set of presently preferred equalizerequations, FIG. 11C shows a four state trellis and the maximization ofthe cumulative metric, and FIG. 11D shows the update procedure for thePE Local Memory and Channel LUT.

FIG. 12 is a logic flow diagram in accordance with a method of thisinvention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

As employed herein a systolic processor is one that “pumps” or transfersdata from one processing element to another. Systolic processor arrays,per se, are well known in the art. Systolic processor arrays have beenused to increase pipelined computing capability, and therefore thecomputing speed, of various types of signal processors. Systolicprocessor arrays have been used for matrix multiplication, as well asfor handwriting recognition and image processing tasks, among others.The systolic array may have the form of a two-dimensional linear array,or a three-dimensional area array.

An embodiment of the present invention defines a method and an apparatusthat provide for the detection of an 8-PSK signal distorted by passagethrough a communication channel. A preferred, but not limiting,embodiment provides a systolic equalizer for Viterbi equalization of an8-PSK signal for EDGE (Enhanced Data rate for Global Evolution) RFcommunication devices (an evolution of the GSM time division, multipleaccess (TDMA) mobile communications system). In the EDGE system thechannel coder is convolutional, and uses multiple data rates andpuncturing patterns. RMS delay spreads on the order of a symbol periodare common, and equalization is required to meet the minimum performancerequirements of the EDGE standard. Some applicable equalizationalgorithms are discussed by W. Gerstacker and R. Schober, “Equalizationfor EDGE mobile communications”, IEE Electronics Letters, Jan. 2, 2000,Vol. 36, No. 2, pp. 189–191, incorporated by reference herein in itsentirety.

Referring first to FIG. 10, there is illustrated a simplified blockdiagram of a receiver 1, which in the preferred but not limitingembodiment is an EDGE receiver, that includes an equalizer 10 that isconstructed and operated in accordance with this invention. An input RFsignal is first converted to a digital signal by analog-to-digitalconversion (ADC) to form the input signal to a receiver filter 2. Thefiltered signal z is then applied to a channel estimator 3, as well asto a prefilter 4. The channel estimator 3 forms an estimate of the RFcommunications channel h₀ using a midamble training sequence of the EDGEburst, and a prefilter function (f) is calculated. The prefilteredsignal samples y, where y equals z convolved with f, are passed to theequalizer 10 along with the impulse response (overall channel estimate)h, where h equals h₀ convolved with f. The signal y is equalized inequalizer block 10, and the resulting soft values, soft, are input to achannel decoder 5, such as a Viterbi decoder.

The theory of operation of the equalizer 10 of this invention issummarized as follows.

Reconfigurable hardware maybe used to create the equalizer 10 inaccordance with the present invention. Locally connected processingelements (PE)s enable small circuit area realizations, and the preferredsystolic architecture of the equalizer 10 is applicable for searching atrellis using the Viterbi algorithm. Realizations of 2, 4 and 16 statereduced state soft output equalizers have been evaluated by theinventors. Evaluations of the central processing tasks show that thesetasks may be implemented with, by example, 30–200 kgates (thousandgates) operated at 25–35 MHz, depending on the number of states in theequalizer.

The following discussion explains the algorithm which the systolicequalizer architecture implements. It is not intended to be read as aderivation of the algorithm.

The formulation given in Lloyd for the Viterbi algorithm is repeatedhere.

For each sample received (y) at time (τ) For each state (k) For eachstate (j) that can lead to this state j′ = argmin_(j)(C_(j)+D_(j,k)(y))C_(k) = C_(j′) + D_(j′,k)(y) For all τ′ < τ P_(k,t′) = P_(j′t′) P_(k,t)= T_(j′,k)

The variables used here are defined as follows: C is the cumulativemetric; D is the branch metric; and P is the symbol history. For theequalizer problem D=∥y−r∥², where y is the received sample and r is areference sample calculated based on path history and the transition tothe current state being evaluated.

A basic function of the Viterbi algorithm is to determine what sequenceof symbols has been received. One additional determination that isneeded for equalization and decoding, such as turbo decoding, is toassociate a probability value with each received element in thesequence. This is referred to in the art as soft output, which is anapproximation of the maximum a posterior (MAP) probability of the bit inquestion (see H. Van Trees, “Detection, Estimation, and ModulationTheory,” John Wiley, New York, 1968 for a discussion of exact MAPprobabilities, and see Koch and Baier for approximate MAPprobabilities.)

A most straightforward way to determine the received sequence is to testfor correlation of the received data with all possible receivedsequences. However, for a GSM 8-PSK EDGE burst of duration 148 symbols,there are too many possible sequences to test for in a typical real-timecommunications application. However, since the sequences all containidentical sub-sequences, the searching task can be reduced significantlyby taking an initial sub-sequence and extending it by one sample. Theresulting sequences are then extended by one sample. For each extension,a distance measure (C, above) is updated to show how close the candidatesequence is to the received data. When two extensions make use of thesame subsequence of length equal to the channel memory, only onesequence is retained (associated with j′ above); the others arediscarded. This is denoted as Maximum Likelihood Sequence Estimation(MLSE), as described by G. Ungerboeck, “Adaptive Maximum-LikelihoodReceiver for Carrier-Modulated Data-Transmission Systems”, IEEE Trans.on Comm., Vol. 22, No. 5, May 1974, pp. 624–636, incorporated byreference herein in its entirety. In this manner the number of candidatesequences is kept manageable. In order to administer these extensionsand comparisons, the state (j, above) of the channel memory is defined.Any candidate sequence passing through a given state represents ahypothesis that the channel memory was composed of the symbols listed inthe state description at that sample time (τ, above).

A convenient way to represent the states is to identify them aspermutations of the possible transmitted symbol sequences. The length ofthe permutation is equal to the channel memory length less one, L−1. Forexample, for 8-PSK (M=8) and L=3, the state label would be representedby 2*3=6 bits, and the resulting number of states is 64. As an example,assume that the symbol 010₂ were sent, followed by symbol 110₂, and thenfinally symbol b₂b₁b₀. Then state 010110₂ should exhibit a very highmetric in the search process in extending sequences using possiblevalues of b₂b₁b₀.

A commonly used variation of MLSE is to only search over sequences of alimited length. The intersymbol interference (ISI) energy which remainsis accounted for by canceling it using per survivor DFE (decisionfeedback equalization). The full channel length is L_(h), while the MLSElength is denoted L. The number of DFE channel taps is then L_(h)−L.This arrangement is sometimes referred to as Decision Feedback SequenceEstimation (DFSE).

Note that the filter taps used for reconstruction can be fixed at thebeginning of a time slot, and need not be adaptive during the time slot.This is but one distinction between the teachings of this invention andthe systolic decision feedback equalizer (DFE) described in theabove-referenced Chakraborty et al. publication: “A systolic arrayrealization of the adaptive decision feedback equalizer”, SignalProcessing, Vol. 80 (2000), No.: 12, pp. 2633–2640. Another distinctionrelates to the fact that the DFE need not employ a trellis search.

It is useful to reduce the number of states that are implemented. Thiscan be done by labeling the states as permutations of groups (g) thatthe symbols could have been chosen from, as in FIG. 1. These groups aredetermined by partitioning the symbol constellation (see, for example,M. Eyuboglu and S. Qureshi, “Reduced-State Sequence Estimation with SetPartitioning and Decision Feedback”, IEEE Trans. on Comm., Vol. 36, No.1, January 1988, pp 13–20, incorporated by reference herein in itsentirety). Because there are fewer groups than symbols, the number ofpermutations and states is thereby reduced. For example, if L=3 and eachgroup contains two symbols (J₁=J₂=4 using Qureshi notation), so thatthere are 4 groups, then the number of states is 4^(L-1)=16, as opposedto 64. The performance loss is minimized by maximizing the distancewithin the constellation between symbols in the same group, making itunlikely that a received sample will be associated with an incorrectelement of a group rather than the correct element. The number ofsymbols has not been reduced, however, so it is still necessary toextend every sequence using every one of the M symbols in theconstellation.

It is also desirable to obtain soft output corresponding to every codedbit sent by the channel coder of the transmitter. This is done by takinga difference of cumulative metrics, which is related to the likelihoodratio of the bit of interest being a one versus a zero (reference maybehad in this regard to W. Koch and A. Baier, “Optimum and Sub-OptimumDetection of Coded Data Disturbed by Time-Varying IntersymbolInterference,” IEEE Globecom 1990, pp. 1679–1684, incorporated byreference herein in its entirety). For the presently preferredembodiment of the equalizer 10 it is desirable that the equalizer 10search for the best cumulative metric associated with a transmitted oneversus the best metric for a transmitted zero, take the difference, andthen present the difference as the soft output soft. For M=8, there arethree coded bits per received sample y and thus three soft values. Thesesoft values are calculated after all of the energy corresponding to agiven symbol has entered the equalizer 10. This results in producing thesoft values soft associated with symbol s_(k-(L-1)), after receivingsymbol s_(k). Because of the operation of reducing the number of states,the possibility may arise that there are no cumulative metrics observedby the equalizer 10 with an associated path history of a ‘1’, forexample, at history position k−(L−1), bit i (for M=8, there are 3 bitsat each symbol position, i=0, 1 or 2). In this case, the soft value issubstituted. This substitution preferably reflects the probability ofthe bit in question.

The equations above can be extended as follows in order to obtain softvalues:

For each sample (y) received   For each state (k)       For each state(j) and transition (m) that can lead to this       state         Foreach bit position (n) in the symbol             b =history_(j)<<(log₂M)*(L−1)+n             C_0(n) = max_(b=0) C(j,m)            C_1(n) = max_(b=1) C(j,m)     For each bit position (n) inthe symbol       if (soft(τ −_τ_(m),n) exists)           soft(τ−_τ_(m),n) = C_0(τ,n)−C_1(τ,n)       else           soft(τ −_τ_(m),n) =substitute(τ −_τ_(m),n).

The original descriptive equations are sufficiently general to includethe possibility of reduced states with the addition of an additionalinternal loop. The additional internal loop would read “For eachtransition (m) which can lead to this state from state (j).”

The approximation of the MAP soft values may be developed using, forexample, the approach of W. Koch and A. Baier, “Optimum and Sub-OptimumDetection of Coded Data Disturbed by Time-Varying IntersymbolInterference,” IEEE Globecom 1990, pp. 1679–1684. In Equation 15 of Kochand Baier it is pointed out that:−ln(e^(x)+e^(y))=min(x,y)−ln(1+e^(|y-x|)), so that the log of the summay be approximated by using a minimum operation. When a departure fromexactness is made, many alternatives are possible. In the presentlypreferred embodiment of this invention the natural log (ln) andexponential (exp) functions have been eliminated in favor of using theabove-shown max( ) operations, i.e., C_(—)0(n)=max_(b=0) C(j,m), andC_(—)1(n)=max_(b=1) C(j,m). Note in this regard that max( ) and min( )are the same operation with the exception of the sign of the value beingtested.

For the purposes of this invention, in a presently preferred embodimentthe soft assignments: (soft(τ−_Σ_(m),n)=C_(—)0(τ,n)−C_(—)1(τ,n),soft(τ−_τ_(m),n)=substitute(τ−_τ_(m),n)) are considered to be or torepresent an approximation of the maximum a posteriori (MAP)probabilities.

At this point it will also be useful to provide a short discussionrelated to the existence of cumulative values for the soft metriccalculation. The soft values are calculated as the difference betweentwo cumulative values. Those two cumulative values are approximations ofthe probability of a “1” and the probability of a “0” (see, for example,Equation 14 of Koch and Baier). In a reduced state sequence estimator(RSSE), one of these probabilities may not have been stored by thealgorithm, i.e., no value exists for the algorithm to use. Thissituation can occur when one of these event probabilities (for example,the event “0”) is so low that all of the surviving paths only containinformation about the other event (in this example, the event “1”). Thislack of information is not important to performance because it onlyoccurs in unambiguous cases. However, one must still supply the decoderwith an appropriate soft value (in this example, the soft value wouldstrongly favor a “1” in the decoding process). There are a number ofways to substitute an appropriate soft value. One technique is referredto as DFsoft, and is explained in commonly assigned U.S. patentapplication Ser. No. 09/928,927, filed Aug. 13, 2001, “Soft BitComputation for Reduced State Equalizers”, Andrei Malkov, Heikki Berg,Pekka Kaasila, Kiran Kuchi and Jan Olivier, incorporated by referenceherein in its entirety. In the DFsoft approach, the reference sample, r,of FIG. 3A is computed for each of the constellation points, x (seeFIGS. 1 and 11A). The intersymbol interference due to earlier symbols isapproximated by using the path history with the largest metric. In an8-PSK embodiment, each constellation point x has a one-to-one mapping toa 3-bit sequence. For the bit in question, the Euclidean distancebetween the nearest symbol with a “0” in that position is subtractedfrom the Euclidean distance of the closest x with a “1” in that bitposition. This value is substituted for the missing soft value. In theexample at hand, the symbol with a “0” in that position would be quitedistant, and a metric favoring “1” strongly would be created.

The following description is intended to aid the reader in visualizinghow the algorithm is partitioned over various computational elements.This is an important step in the hardware design, as it identifies therequired computational units and how the data moves between thecomputational units. The pseudocode provided below illustrates in detailthe algorithmic steps.

In the presently preferred embodiment of this invention the equalizer 10is embodied in an integrated circuit, more specifically in an ASIC. Assuch, a number of practical considerations arise related to gate count,silicon area, clock rate, power consumption and flexibility (amongothers). The presently preferred embodiment is influenced at least inpart by the VLSI Viterbi algorithm proposals by Lloyd and Gulak, asmodified so as to provide a moderate clock rate and a minimization ofintegrated circuit area. Embodiments of the invention also preferablyprovide a scalable architecture that may be configured programmaticallyto realize various numbers of states at run time.

Three essentially generic Viterbi decoder architectures were studied inLloyd. These were: (a) a purely serial processor, much like a DSP; (b)the serial processor with some additional numeric components; and (c) aparallel processor, with one processing element (PE) per trellis state(shown in FIG. 3 of Lloyd.

Additional architectures are presented by Gulak. A most interesting onefor extension to the teachings of this invention is the linear array.This is a systolic array in which there is one PE for every state, andwhere data copies are circulated between the PEs. When describing theinstant invention, use is made of the term “Record” for the data copiesthat are circulated between the PEs (from the Pascal programminglanguage). For a fully connected trellis, all of the PEs aresimultaneously busy operating on Records (i.e., on data copies).

FIG. 2A illustrates a parallel processing element (PE 12) having anassociated Local Memory (LM) 13, a Channel LUT 14 and an input Record18. FIG. 2B shows a linear array of the PEs 12 (12A–12D), assumed inthis case to have the LM 13 internalized. In sequence, the input sampley arrives, the Records 18A–18D are processed in parallel by the PEs12A–12D, the Records 18 shift forward (form left to right in thisembodiment) and are processed again, and after all PEs 12 have processedall Records 18, the soft values (s) are output. FIG. 2C is another viewof the equalizer 10 with data circulating between the linear array ofPEs 12 (N PEs, PE₀ to PE_(N-1), designated for convenience as 12A–12D),in a manner somewhat similar to that shown in FIG. 3 of Lloyd and inFIG. 4 of Gulak. However, this invention further provides the ChannelLUT 14 coupled to each PE 12 (Channel LUTs 14A–14D, respectively), andemploys a soft value substitution unit 16, coupled to the right-most PE12D of the linear array, in regards to the presently preferredembodiment of the channel equalizer 10. The circulated data (Records 18)include parameters: Cumulative, History, smallest_Cum, cum0 and cum1, asdiscussed below. Each PE 12A–12D is assumed in this embodiment toinclude the Local PE Memory (LM) 13A–13D, respectively. Common inputs tothe equalizer 10 include a synchronization signal (synch), a clocksignal (f_(logic)), and the input samples y of the signal received fromthe channel that is to be equalized.

The basic steps in the operation of the equalizer 10 are as follows:

-   -   Obtain the channel estimate h and calculate the Channel LUT 14;    -   Initialize the records (R) and PEs 12;    -   For each received data sample:        -   normalize the cumulative metric in PE Local Memory (LM) 13;        -   Repeat N times            -   Each PE accepts the record at its input            -   The record is processed, PE LM 13 is updated and some                record fields may be overwritten.            -   shift the record from the PE input to the PE output    -   compute the soft output values    -   wait for the next burst

The Channel LUT 14 stores all possible products between the channel tapsand constellation points x. For example, if the total estimated channellength, L_(h),=6 and M=8, then there are 48 such complex products storedin the Channel LUT 14. The element responsible for loading of theChannel LUT 14 is shown in FIG. 3A as the unit 15.

The synch signal is used to initialize and activate the PEs 12 for eachrecord. Also the synch signal controls the computation of the softvalues from the record R_(N-1) after processing of the sample iscompleted. The f_(logic) clock represents the highest clock frequencyrequired for operation of the PEs 12 so as to complete theircomputations within the allocated interval of time. As will be shownbelow, a suitable clock frequency maybe about 30 MHz.

FIGS. 11A, 11B, 11C and 11D are also useful in explaining the operationof the equalizer 10. FIG. 11A shows the overall algorithm flow, thetransition k from state i to state j given an input sample y and impulseresponse h for a constellation point x_(k) associated with transition k,as well as a definition of the notation. FIG. 11B shows a set ofpresently preferred equalizer equations. FIG. 11C shows a four statetrellis and the maximization of the cumulative metric assumingconstellation groups g₀ through g₃, as in FIG. 1. FIG. 11D shows theupdate procedure for the PE 12 Local Memory 13 and Channel LUT 14, basedon the input Record 18.

The internal components of a PE 12 are shown in FIG. 3A. The PE 12includes, in addition to the aforementioned LM 13, a Local Finite StateMachine (FSM) 20, a local compares block 22, a global compares block 24,a magnitude squares block 26, a switch (SW) 28, and three summationnodes 30, 32 and 34. The following functions are carried out inside thePE 12 under the control of the local FSM 20. Before the processing of asample begins, the cumulative metric (cum) value stored in the LM 13 isnormalized. Also, the record R, designated Input Record 18, is updatedbased on the data in LM 13. Then repetitive processing of records beginsas was explained above with regard to FIG. 2. The processing of theInput Record 18, such as one obtained from the adjacent PE 12 of FIG. 2,is carried out by an examination of each possible trellis transitionfrom the state represented by R to the state represented by the PE 12.The path history associated with the trellis transition underconsideration is used to address the Channel LUT 14. Output values fromthe Channel LUT 14 are successively summed in node 34 until thereference value, r, has been formed. At this time SW 28 is closed, and ris subtracted from the sample, y, in node 32, to obtain the differenced. The magnitude square of d is taken in block 26 to obtain the branchmetric, b. The branch metric b is added in node 30 to the R_cum valuefrom the record, R, to obtain a new cumulative metric (cum) forconsideration. The new cum is passed to the Local Compares block 22which may make changes to both the record R 18 and the LM 13. At thesame time the Global Compares block 24 operates and may make differentchanges to R 18 and the LM 13.

The operation of the PE 12 may be summarized in a general sense asfollows:

For each Record 18 arriving at the PE 12 input:

-   -   For each parallel transition of the trellis (illustrated in FIG.        11C), the PE 12:        -   (a) calculates the reference, r, where in the presently            preferred, but non-limiting embodiment the Channel LUT 14 is            used for this purpose;        -   (b) computes a difference d=y−r;        -   (c) squares the difference value, b=|d|²;        -   (d) adds this branch metric to the cumulative metric of the            Record 18; and        -   (e) examines the path history in the Record 18 and updates            the Local Memory 13 path history and the Local Memory 13            metrics as required;    -   carries out the global compares and updates the Record 18 and        Local Memory 13 as needed.

The Local Compares block 22 is shown in FIGS. 4A and 4B. The function ofthe Local Compares block 22 is to pose the following inquiries, and tooverwrite data as needed according to the results:

-   Is R_cum (Record Cumulative) greater or less than LM.13 cum    (cum>?LM.cum)-   cum<? LM.new_smallest cum.-   cum>? LM.new_cum0_bit 0, 1, or 2-   cum>? LM.new_cum1_bit 0, 1, or 2.

If any of these tests is affirmative, as indicated by comparator 23,shown as comparators 23A–23E in FIG. 4B, then the previous value in theLM 13 is overwritten under control of the overwrite control block 22A.The cum0 and cum1 tests are contingent on a value existing there. Notein FIG. 4B that a decode unit 22B can be used to decode the history andexistences outputs of the LM 13, and the decoded output is used as aselection input to three multiplexors (MUXes) 22C–22E thereby enabling,for this non-limiting embodiment, three bit-specific tests viacomparators 23C, 23D and 23E.

The global compares block 24 is shown in FIGS. 5A and 5B. The purpose ofthe global compares block 24 is to search through the PEs 12 anddetermine with comparators 25A–25D, shown collectively as comparator 25in FIG. 5A, several globally large or small values. The smallest metricis located for normalization purposes, while the best cum0 and cum1metrics are located to compute the soft values (s). This occurs incooperation with a decode block 24B, driven by LM 13 and R 18 existencesvalues, and MUXes 24C and 24D. After the processing of a sample, theLocal FSM 20 directs the placement of the LM.new_smallest_cum value intothe LM.old_smallest_cum. During the operation of the Global Comparesblock 24 this value moves into the Record 18 being processed if it isthe smallest in value. As each PE 12 processes successive Records 18,the LM_smallest_cum value takes on the minimum value over all PEs 12. Itis only necessary to activate the Global Compares cum0/cum1 operationsfor R_(N-1), which visits each PE 12 once.

Further with regard to the operation of the Global Compares block 24,after each sample period the extreme metric values need to be found forthe purposes of normalization and computing the soft values. Thesevalues should be the extremes as searched over all of the PEs 12. Thevalues which are to be searched are not, however, set until all of therecords have been processed. A direct solution would create acentralized global searcher which could access all of the PEs 12.However, this approach would require a star shaped bus or a shared busto be incorporated into the systolic array design, and would greatlydiminish the benefits of the algorithm realization on the presentlypreferred locally-connected structure. It is thus preferred to obtainthe desired values after a one sample delay using a serial search. Theone sample delay has no impact on performance, as the Records 18 arealready circulating in a serial fashion in order to compute the metricsand path histories. The search is accomplished by adding the cum_oldtype variables to the Records 18 and Local Memories 13, and using theGlobal Compare 24 function. N values are searched to achieve thefunction. This can be achieved by only activating the Global Compare 24function in the rightmost PE 12 (all N Records 18 pass through every PE12 in the preferred embodiment). As such, it can be appreciated that onemay save power and gates by providing the Global Compares 24 block inonly the right-most PE 12. Alternatively, if provided in each PE 12 theGlobal Compares 24 could be powered down if not in use, or until use isrequired.

In certain embodiments, for example in FIG. 9B, a final comparison wouldbe done in block 16 of FIG. 9B for the soft values. The normalizationmay be done locally in each sub-array (e.g., PE00,PE01,PE02,PE03), or inblock 16 and communicated to the PE subarrays. Normalization per se in aViterbi-type algorithm is a well-known step to reduce the number of bitsneeded to represent the cumulative metrics.

The foregoing discussion concerns both two state and four stateequalizers 10. For a 16 state equalizer 10, the trellis is not fullyconnected as shown in FIG. 9A. This means that the components of thesystolic equalizer 10 are not all busy if every state record circulatesthroughout the linear array. However, symmetry conditions exist in the16 state trellis such that it can be realized using four banks of fourelement PEs 12. The following discussion pertains to the specialrequirements for the 16 state systolic equalizer 10, followed by adescription of how to map many PEs 12 to a single PE using a higherclock rate. FIG. 9B is a block diagram of a systolic implementation ofthe 16 state equalizer 10, where there are four instances of the four PElinear array shown in FIGS. 2B and 2C, each operating in parallel toprovide soft values (s) from the soft value calculation block 16. TheChannel LUTs 14 are not shown in order to simplify the drawing.

Examination of the trellis for the 16 state equalizer 10 of FIG. 9Ashows that it may also be implemented with a linear array as shown inFIG. 6. The 16 state linear array is composed of four linear PE arrays,as shown in FIG. 9B, where each linear array forms one row of the tableshown in FIG. 6. During processing of a sample y, a Record 18 exitingfrom the right-most PE 12 re-enters at the left-most PE 12 in the samerow. In this manner the 16 state equalizer 10 uses four times the PEs 12and four times the Records of the four state equalizer shown in FIGS. 2Band 2C. After the sample y is processed, the soft values at therightmost PE's are gathered by the comparison unit 16 for reduction to asingle soft value per bit.

After the sample y has been processed, each PE 12 updates the Recordmemory adjacent to it (as before, except that the PE 12 label generallydoes not match the Record label). The Records 18 are then circulatedthrough the entire array so that they are available to a PE where theyare needed. Also, an additional element of delay in finding the smallestcum is preferably used.

The four PEs 12A–12D of FIGS. 2B and 2C may be mapped to a single PEoperating at a higher clock rate by adding memory (see Pirsch in thisregard). If one assumes that the LM 13 is removed from each PE 12, andthat delays are added between PEs 12, one obtains the embodiment of FIG.7, where the circles between PEs 12 represent delay elements.

In this case it is desired that PE0 processes R0, PE1 processes R1, PE2processes R2 and PE3 processes R3, followed by shifting the recordswhich results in PE0 processing R3, PE1 processing R0, PE2 processing R1and PE3 processing R2, followed by shifting the records and PE0processing R2, PE1 processing R3, PE2 processing R0 and PE3 processingR1, followed by shifting the records and PE0 processing R1, PE1processing R2, PE2 processing R3 and PE3 processing R0, and thenoutputting the delayed soft values from block 16. This can beaccomplished by shifting the LMs 13 (which actually define the identityof the PE 12) against the proper Records 18. The result is the equalizerembodiment shown in FIG. 8.

According to the schedule sequence discussed above, after R0–R3 havebeen processed once, the Records 18 are shifted backwards against thedirection of the arrow one space so that LM0 lines up with R3 at thebeginning of the processing of the next four record times. Since one PE12 in the four PE embodiment of the four state equalizer 10 processedeach record four times, when using one PE 12 as in FIG. 8 there are atotal of 16 record processing events.

A presently preferred algorithm for use in the systolic equalizer 10architecture is now described.

The following description includes variable declarations and pseudocodefor the burst by burst operation of the systolic equalizer 10. All ofthe operations shown in the pseudocode are implemented by the hardwareelements discussed above.

Definitions:

Constants

N Number of 2, 4, or 16 States NP Number of 2, 4 or 16 (or 1) PEs NTNumber of 4, 2 or 2 Member Symbols in Partition LH Length of 6 History LNumber of 2, 2, or 3 MLSE taps M Modulation 2 or 8 Level Ns Number of3 + 58 + 26 + 58 + 3 = 148 samples processed in a burst Nsmall Init.value −1000 or the like Nbig Init. value 1000 or the likeVariable Names

Name Meaning R Record Unit 18 LM Local Memory 13 h channel estimate xconstellation points C_LUT Channel Lookup Table 14Operation Names

Mc Complex Multiply Mr Real Multiply Ac Complex Addition Ar RealAddition T Transfer a value to a registerRecord Unit 18 Structure

-   -   Cum 16 bits    -   History 15 bits    -   old_cum0 (3 values)    -   old_cum1 (3 values)    -   smallest_cum 16 bits    -   cum0_exists (3 bits)    -   cum1_exists (3 bits)

Each record 18 has a need for nine, 16 bit words and two, 3 bit words.There is one record per state.

Local Memory 13 Structure

-   -   Cum 16 bits    -   History 15 bits    -   new_smallest_cum 16 bits    -   old_smallest_cum 16 bits    -   new_cum0 (3 values)    -   new_cum1 (3 values)    -   old_cum0 (3 values)    -   old_cum1 (3 values)    -   cum0_existences (3 bits)    -   cum1_existences (3 bits)

Each LM 13 employs 24, 16 bit words and six, 3 bit words. There is oneLM 13 per state.

Memory

Altogether, this embodiment of the systolic equalizer 10 uses 33, 16 bitwords and eight, 3 bit words per state.

N States 16 bit words 3 bit words 2 66 16 4 132 32 16 528 128

// Partition holds the complex symbol values for this state. Selectedfrom J1 partitions of 8-PSK

For Each Burst:   // initialize registers in Records (R) and LocalMemories (LM)   for (i=0;i<N;i++) {     R(i).Cum = Nsmall;    R(i).History = 0;     R(i).old_cum0 = Nsmall;     R(i).old_cum1 =Nsmall;     R(i).smallest_cum = Nbig;     R(i).cum0_exists = (0,0,0);    R(i).cum1_exists = (0,0,0);// opcount: 7 N T   }   for (i=0;i<N;i++){     LM(i).Cum = Nsmall;     LM(i).History = 0;    LM(i).new_smallest_cum = Nbig;     LM(i).old_smallest_cum = 0;    LM(i).new_cum0 = Nsmall;     LM(i).new_cum1 = Nsmall;    LM(i).old_cum0 = Nsmall;     LM(i).old_cum1 = Nsmall; // opcount: 8N T   }   // generate the Channel_LUT 14 using unit 15   for(i=0;i<M;i++) {     for (j=0;j<LH;j++) {       C_LUT(i,j) = h(j)*x(i);// opcount: M Lh Mc        }   }   // for each symbol   for(is=0;is<Nsym;is1++) { // it is implied that all op counts in this loopare * Ns     // normalize cumulative metrics     for (i=0;i<N;i++) {      LM(i).cum −= LM(i).old_smallest_cum; // N Ar     }     // computesoft     // substitute values; may use the method explained by A.Viterbi, “CDMA     Principles ofSpread SpectrumCommunication”,Addison-Wesley, 1995,     eqtn 4.51, or any othersuitable method, including DFsoft     // r_dfe is a reference signalbased on the jth constellation symbol at time     is-(L−1)     // d isthe square of the distance from the received sample to r_dfe     for(j=0;j<M;j++) {       d(j) = |y(is-(L−1))-r_dfe(is-(L−1),j)|²;        }       // best0 is the value of d with associated with a symbol with a 0      in position k        // best1 is the value of d with associatedwith a symbol with a 1       in position k        for (k=0;k<Log2M;k++){         best0(k) = min0(d,k);         best1(k) = min1(d,k);        sub_soft(k) = best1(j)−best0(j);     }     for (i=0;i<Log2M;i++){       if(R(N−1).old_cum0_exists(i) && R(N−1).old_cum1_exists(i)) {        soft(i) = R(N).old_cum0(i) − R(N).old_cum1(i);       }      else {         soft(i) = sub_soft(i);       }     }     //transfer some Local Memory 13 elements into Record 18 and initialize    existence variables     for (i=0;i<N;i++) {       for(ib=0;ib<Log2M;ib++) {         LM(i).old_cum0(ib) = LM(i).new_cum0(ib);        LM(i).old_cum1(ib) = LM(i).new_cum1(ib);       } // 6 N T      R(i).cum = LM(i).cum;       R(i).history = LM(i).history;      R(i).old_smallest_cum = Nbig;       LM(i).old_smallest_cum =LM(i).new_smallest_cum;       LM(i).new_cum0_exists = (0,0,0);      LM(i).new_cum1_exists = (0,0,0);       LM(i).cum = Nsmall;      LM(i).new_smallest_cum = Nbig; // 8 N T     }     for(ir=0;ir<N;ir++) { // for each record       // below is in parallel ifNP = N; operate each PE       for (ip=0;ip<N;ip++) {         for(it=0;it<NT;it++) { // for each parallel transition       //N*N*Nt*T  rM = C_LUT(0,LM(ip).Partition(it));           rD = 0+j0;          for (ic=0;ic<LH−1;ic++) {      //N*N*Nt*(Lh−1)*T  product=C_LUT(ic+1,R.History(ic+1));      //N*N*Nt*(Lh−2)*Ac  rD += product;           }       // N*N*Nt * 1Ar r = rD + rM;       // N*N*Nt* 1 Ar d = y(is) − r;       // N*N*Nt*(2Mr+1Ar) b = |d|²;       // N*N*Nt* 1 Ar c = R(ir) + b;       //N*N*Nt* 1 T cum = trunc(c);       // N*N*Nt* 1 Ar if (cum > LM(ip).cum){       //N*N*Nt* 2 T LM(ip).cum = cum :         LM(ip).History =R(ir).History;   }         if (cum < LM(ip).new_smallest_cum) {      LM(ip).new_smallest_cum = cum;   }     if(R(ir).smallest_cum<LM(ip).old_smallest_cum) { //N*N*Nt* 1 Ar  LM(ip).old_smallest_cum = R(ir).smallest_cum; } else {  R(ir).smallest_cum = LM(ip).old_smallest_cum; } for(ib=0;ib<Log2M;ib++) {    bit = R(ir).History >> bit_delay & 1;    if(!bit) {     if (cum>LM(ip).new_cum0(ib)) { //3* N*N*Nt* 1 Ar (Log2M=3for 8- PSK)       LM(ip).new_cum0 = cum;       new_cum0_exists(ib) = 1;// 3 * 2 T     }    }    else {       if (cum>LM(ip).new_cum1(ib)) {//3*N*N*Nt* 1 Ar       LM(ip).new_cum1 = cum;       new_cum1_exists(ib) = 1;    }    }    if (!R(ir).old_cum0_exists(ib) &&LM(ip).old_cum0_exists(ib))) {     R(ir).old_cum0 = LM(ip).old_cum0(ib);    R(ir).old_cum0_exists(ib) = 1; // this will not happen often    }   if (R(ir).old_cum0(ib)<LM(ip).old_cum0(ib) &&R(ir).old_cum0_exists(ib) && LM(ip).old_cum0_exists(ib) ) {// N*N*Nt* 1Ar     R(ir).old_cum0(ib) = LM(ip).old_cum0(ib); // 3 * T    }    if(!R(ir).old_cum1_exists(ib) && LM(ip).old_cum1_exists(ib))) {    R(ir).old_cum1 = LM(ip).old_cum1(ib);     R(ir).old_cum1_exists(ib)= 1; // same for cum1, won't happen often    }    if(R(ir).old_cum1(ib)<LM(ip).old_cum1(ib) && R(ir).old_cum1_exists(ib) &&LM(ip).old_cum1_exists(ib) ) {// N*N*Nt* 1 Ar     R(ir).old_cum1(ib) =LM(ip).old_cum1(ib); // 3 * T    } }         }       }     }   }

FIG. 12 is a logic flow diagram in accordance with a method of thisinvention, as described above in reference to the pseudocode example. AtBlock A a new burst is received from the channel, and at Block B theChannel LUT 14 is initialized with the product of the channel tapestimates and the constellation points. At Block C the LMs 13 andRecords 18 are initialized for each state. Block D represents an outerloop of Ns times for each received sample y from the prefilter 4. BlockF represents a first inner loop of N, where N is the number of states.At Block F certain LM 13 information is transferred to the Record 18.Block. Block F represents a second inner loop of Nt, the number ofparallel transitions. At Block H the method uses the Record History, thecurrent state and transition value to sum the reference r, using theoutput from the Channel LUT 14 (summation element 34 in FIG. 3A). AtBlock I the difference d is computed as y−r (difference element 32 inFIG. 3A), and the branch metric b is then computed in the absolutemagnitude squared element 26 of FIG. 3A. In Block 3 the cumulativemetric is computed by adding in summation element 30 of FIG. 3A theoriginating state cumulative metric from the Record 18 and the branchmetric b. At Block K the Local Compares block 22 of FIG. 3A decodes(decoder 22B) the History and soft bit existences from the Local Memory13, and at Block L the cumulative metric comparisons are performed(using comparators 23A–23E of FIG. 4B). At Block M the methodoverwrites, as necessary, the cumulative metric and soft bit values, andat Block N the soft bit values are computed, substituting is necessary.At Block O the loop is reentered at the appropriate point, depending onthe state of the outer and two inner loop counters.

As the preferred embodiment of the invention is implemented in an ASICor some other type of integrated circuit, a consideration is now made ofthe gate count and timing. The gate count follows from the hardwarearchitecture. The timing analysis may use both the hardware and thepseudocode to evaluate the critical path for determination of theminimum clock rate required.

The gate counts for the basic components needed are provided forreference purposes only, and are exemplary of one suitable embodiment.

Mult- n_input bit Adder-real Multiplier-real Adder-Complex Complex 10260 870 520 4000 11 286 1045 572 4752 12 312 1236 624 5568 13 338 1443676 6448 14 364 1666 728 7392 15 390 1905 780 8400 16 416 2160 832 947217 442 2431 884 10608 18 468 2718 936 11808 19 494 3021 988 13072 20 5203340 1040 14400

Each device is assumed to have a latched input and output. A real adderrequires 26n gates and a real multiplier requires 8n²+7n gates, where nis the input word length.

Using the gate counts one may estimate the number of gates for a singlePE 12, excluding control logic.

-   r—1 complex adder.-   d=y−r; 1 complex adder-   b=|d|²; 2 real multipliers and 1 real adder;-   c=c+b; 1 real adder-   local compares; 1 real adder-   global compares; 1 real adder;-   memory-   1 record and 1 LM per PE require a total of 25 16 bit words-   1 LUT per PE requires 96 16 bit words

PE 12 Gate Count function (all are 16 bits) Number per PE gates complexadder 2 1664 real multiplier 2 4320 real adder 4 1664 LM (16 bit words -latch) 9 576 R (16 bit words - latch) 16 1024 LUT (16 bit words - 962304 SRAM) Total per PE with dedicated 11,452 approx. LUT 12,000

Latch memory is counted at 16 transistors per bit, 64 gates per 16 bitword. SRAM memory is counted at 6 transistors per bit, 24 equivalentgates per 16 bit word. The gate counts shown include redundantregisters, as the gate counts for each adder and multiplier includeindividual latches at the input and output.

Per equalizer, 1 complex multiplier, assume 10 bit inputs, is needed forconstruction of the LUT 14. This adds 4000 gates to the overall total.Assume C_LUT is calculated once per burst and then copied in to each ofN local lookup tables, one per PE 12. In this way, it is not necessaryto arbitrate access to the data therein. The number of C_LUT registerscould be reduced to 96 in all cases by introducing arbitration betweenthe PEs 12 and the Channel_LUT table 14.

Gate Count Totals 2 state, 2 state, 4 state, 4 state, 16 state, 16state, number gates number gates number gates 10 bit 1  4,000 1  4,000 1 4,000 complex mult 2 24,000 4 48,000 16 192,000 number of PEs 2 4 16number of LUTs Total 28,000 52,000 196,000 gates

A purpose of the timing analysis is to discover the critical path of thealgorithm shown in the pseudo code when implemented on the systolicequalizer 10. Reference can also be made to FIG. 3B, which illustratesthe PE 12 of FIG. 3A so as to emphasize the pipelined portion of thearchitecture and the portion where the transitions are processedserially.

Once per burst:

-   1. Initialize the 48 complex elements of Channel_LUT 14.-   2. Initialize the LM 13 and R 18 of each PE 12

for each receive sample {   Normalize LM.cum, compute soft substitution.  transfer LM into R. - 9 words, e.g., nine clocks (ticks)   Repeat Ntimes { // cycling of records     Repeat Nt times { // paralleltransitions; assume pipelining of     this Nt loop       Repeat Nh times{ // sum up the reference             pipelined { Read from C_LUT, Clock            Adder1 }         } // Nh; give this Nh+1 = 7 ticks              subtract from sample using Adder2;               1 tick              take magnitude square; 2 ticks               add R.cum; 1tick               cum has been produced for this              transition               5 similar compares: {                smallest_cum,cum,cum0/1(3)               }              potentially overwrite history; 2 ticks           } // Nt  } // Ncompute soft outputs; clock ticks neglected

Timing critical path 2 state 4 state 16 state formation of r (pipelined)7 ticks 7 ticks 7 ticks d = r-y; (pipelined) 1 1 1 b = |d|^(2;)(pipelined) 2 2 2 3 local compares cum0/1 6 6 6 min cum compare 2 2 2max cum compare 2 2 2 possible history overwrite 2 2 2 ticks pertransition 10 + 12 Nt 10 + 12 Nt 10 + 12 Nt transitions per record, Nt 42 2 records per PE 2 4 4 extra ticks to position records 0 0 16 for nextsample (at least) extra ticks to load LM into R 9 9 9 before next sampleticks per sample 117 145 161 samples per burst 116 116 116 burst period(microseconds) 577 577 577 minimum required clock rate approx. approxapprox (MHz) 24 MHz 30 MHz 33 MHz.

It would be possible to schedule unrelated comparisons to overlap. Forinstance, if the first transition has a cum0 value for bit 0, and thesecond transition has a cum1 value for bit 0, then these can be updatedand compared to Local Memory 13 in parallel. An overwrite of LocalMemory 13 by the first transition's value of cum0, bit0, will notcollide with a comparison of that register by the second transition,since it has no value there. Allowing for these kinds of scheduling canreduce the critical path

Based on the foregoing description those skilled in the art shouldappreciate that the disclosed systolic equalizer 10 provides a number ofadvantages over conventional equalizers. It is again noted that whileLloyd discuss a locally connected array (FIG. 3), and state that theirapplication is applicable to both Viterbi decoding and Viterbiequalization, they do not disclose the scalable channel equalizer 10that is constructed and operated as a parallel, systolic array of likeprocessing elements 12 that exhibits, among other features, reducedstate sequence estimation, decision feedback and the global searchfunction (Global Compares block 24) for metric normalization and softvalue determination. In the presently preferred embodiment of theequalizer 10 the sorting of soft values can be accomplished by multiplepasses through the systolic array.

In addition, a 16 state equalizer can be realized as four, four PE 12linear arrays, and by shifting the Records 18 as described above. Thisinvention also provides for cycling the LM 13 against the Records 18,interleaved with shifts of the Records 18 every four clocks to provide afour state equalizer using but one PE 12. Thus, it should be appreciatedthat this invention provides an equalizer 10 that contains a logicalarrangement of a plurality of instantiations of the locally coupled,possibly identical processing elements 12 that form the systolic array.The arrangement maybe viewed as being logical in the sense that, asexamples, a 16 state equalizer can be realized with but four physicalPEs 12, and a four state equalizer can be realized with but one PE 12.

The scalability made possible by the teachings of this invention alsoenables using one four PE systolic array to realize a two stateequalizer, such as by powering off or otherwise disabling two of thePEs; a four state equalizer (as shown in FIGS. 2B and 2C); or a 16 stateequalizer by cycling the LMs 13 and the Records 18, as shown in FIG. 7.

This invention also provides for using combinatorial logic, as shown inFIG. 4B, for decoding path histories and activating cumulative metriccomparators for sorting cumulative metrics so as to ultimately obtainthe desired soft bits.

Although described in the context of particular embodiments of thescalable systolic architecture, it should be apparent to those skilledin the art that a number of modifications and various changes to theseteachings may occur. Thus, while the invention has been particularlyshown and described with respect to one or more preferred embodimentsthereof, it will be understood by those skilled in the art that certainmodifications or changes, in form and shape, may be made therein withoutdeparting from the scope of the invention as set forth above.

1. An equalizer, comprising a logical arrangement of a plurality ofinstantiations of locally coupled processing elements PEs forming asystolic array for processing in common received signal samples havingdistortion induced by passage through a communications channel, andoutputting soft values for input to a decoder, the soft valuesrepresenting an approximation of maximum a posteriori (MAP)probabilities, the equalizer further comprising a global compares blockfor comparing a data entry in records operated on by said plurality ofinstantiations of locally coupled PEs and selectively overwriting saiddata entry in one of said records based on the comparing.
 2. Anequalizer as in claim 1, where a trellis search procedure is employed toreconstruct estimates of a received signal sequence based on a reducednumber of states represented by a plurality of groups determined bypartitioning a symbol constellation such that there are fewer groupsthan possible symbols.
 3. An equalizer as in claim 2, where the symbolconstellation represents one formed by 8-PSK modulation of a transmittedsignal sequence.
 4. An equalizer as in claim 1, where an effect of aprior symbol is subtracted using a decision feedback mechanism.
 5. Anequalizer as in claim 1, where said locally coupled processing elementsoperate in parallel and each comprise an input node for receiving aRecord to be processed; a node for coupling to a channel look-up table(Channel LUT) addressed by the Record and storing products of individualchannel taps with individual constellation points of a symbolconstellation; a local memory (LM); and circuitry for calculating areference, r, using data stored in the Channel LUT; circuitry forcomputing a difference d=y−r, where y is one of the received signalsamples; circuitry for squaring the difference value, b=|d|² to form abranch metric; circuitry for adding the branch metric to a cumulativemetric of the Record and circuitry for examining a path history in theRecord and updating the LM path history and the LM metrics as needed,and where at least one processing element further comprises circuitry,operating after all Records are processed, for performing a globalmetric comparison and updating Records and LMs as needed.
 6. Anequalizer as in claim 1, where said locally coupled processing elementsare coupled together as a linear systolic array of processing elements.7. An equalizer as in claim 1, where said logical arrangement of theplurality of instantiations of locally coupled processing elements isembodied in one processing element, where for N states the oneprocessing element successively processes N instantiations of a localmemory against an input Record, each of which stores data that comprisescumulative metrics.
 8. An equalizer as in claim 1, where for N statesthere are M instantiations of said locally coupled identical processingelements, where M<N, and where delays are inserted between seriallycoupled processing elements.
 9. An equalizer as in claim 1, comprisingcombinatorial logic for decoding path histories and activatingcumulative metric comparators for sorting cumulative metrics so as toobtain the soft values.
 10. An equalizer as in claim 1, wherein saidequalizer is embodied within an integrated circuit.
 11. The equalizer ofclaim 1 wherein said global compares block is disposed within one ofsaid PEs.
 12. The equalizer of claim 1 wherein said global comparesblock operates to selectively overwrite said data entry when saidcomparing results in a more extreme metric than said data entry.
 13. Amethod for processing, on a burst by burst basis, received signalsamples having distortion induced by passage through a communicationschannel, comprising: providing a reduced state equalizer comprised of alogical arrangement of a plurality of instantiations of locally coupledprocessing elements (PEs) forming a systolic array, each PE having aninput and an output and an associated Local Memory (LM) and ChannelLook-Up Table (LUT); obtaining an estimate of the channel; initializingRecords (R) and PEs and calculating the contents of the Channel LUT; foreach received signal sample y, normalizing a cumulative metric in eachPE LM and repeating N times, accepting a Record at each PE input;processing the Record and updating each PE LM; and shifting the Recordfrom the PE input to the PE output; and outputting soft values for inputto a decoder, the soft values representing an approximation of maximum aposteriori (MAP) probabilities.
 14. A method as in claim 13, where eachRecord comprises a cumulative metric, a trellis path history, oldcumulative metrics, a smallest cumulative metric, and existingcumulative metrics, and where each LM comprises the cumulative metric,the path history, new and old smallest cumulative metrics, new and oldcumulative metrics, and existing cumulative metrics.
 15. A method as inclaim 13, where processing the Record comprises computing a branchmetric, combining the branch metric with an originating state cumulativemetric from the Record to form a new cumulative metric, and performingcumulative metric compares for potentially modifying values stored atleast in the LM.
 16. A method as in claim 13, where said logicalarrangement of the plurality of instantiations of locally coupledprocessing elements is embodied in one processing element, where saidone processing element successively processes N instantiations of the LMagainst an input Record, where N is number of states.
 17. A method as inclaim 13, where there are M instantiations of said locally coupledprocessing elements, where M<N, where N is number of states, and wheredelays are inserted between serially coupled processing elements.
 18. Amethod as in claim 13, where processing the Record comprises decoding apath history and cumulative metric existences and activating cumulativemetric comparators for sorting cumulative metrics so as to obtain thesoft output values.
 19. A method as in claim 13, where processing theRecord comprises each PE operating in parallel to: calculate areference, r, using data stored in the Channel LUT; compute a differenced=y−r; square the difference value, b=|d|² form a branch metric; add thebranch metric to a cumulative metric of the Record; and examine a pathhistory in the Record and updates the Local Memory path history and theLocal Memory metrics as needed.
 20. A method as in claim 19, furthercomprising, after all Records are processed, performing a global metriccomparison and updating Records and Local Memories as needed.
 21. Amethod as in claim 13, where said equalizer performs Viterbiequalization of an 8-PSK, EDGE (Enhanced Data rate for Global Evolution)signal received though an RF communications channel.
 22. A method as inclaim 13, where a trellis upon which the processing elements operatecomprises one of a fully connected trellis or a partially connectedtrellis.
 23. An equalizer, comprising a logical arrangement of aplurality of instantiations of locally coupled processing elements PEsforming a systolic array for processing in common received signalsamples having distortion induced by passage through a communicationschannel, and outputting soft values for input to a decoder, the softvalues representing an approximation of maximum a posteriori (MAP)probabilities, wherein each PE is coupled to a channel lookup table LUTof possible products of channel taps and constellation points.
 24. Theequalizer of claim 23 wherein a separate LUT is associated with each PE.25. The equalizer of claim 23 wherein the LUT is loaded at a start ofeach time slot over which the PEs operate on a set of inputs.
 26. Theequalizer of claim 25 wherein the channel taps used to load the LUT arefixed at the start of each time slot and said LUT is not adaptive duringa time slot once loaded.